True Online TD(λ)

نویسندگان

  • Harm van Seijen
  • Richard S. Sutton
چکیده

TD(λ) is a core algorithm of modern reinforcement learning. Its appeal comes from its equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. However, the equivalence between TD(λ) and the forward view is exact only for the off-line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(λ) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate. In a sense this is unavoidable for the conventional forward view, as it itself presumes that the estimates are unchanging during an episode. In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(λ) that exactly achieves it. Our algorithm uses a new form of eligibility trace similar to but different from conventional accumulating and replacing traces. The overall computational complexity is the same as TD(λ), even when using function approximation. In our empirical comparisons, our algorithm outperformed TD(λ) in all of its variations. It seems, by adhering more truly to the original goal of TD(λ)—matching an intuitively clear forward view even in the online case—that we have found a new algorithm that simply improves on classical TD(λ). 1. Why True Online TD(λ) Matters Temporal-difference (TD) learning is a core learning technique in modern reinforcement learning (Sutton, 1988; Kaelbling, Littman & Moore; Sutton & Barto, 1998; Szepesvári, 2014). One of the main challenges in reinProceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). forcement learning is to make predictions, in an initially unknown environment, about the (discounted) sum of future rewards, the return, based on currently observed feature and a certain behavior policy. With TD learning it is possible to learn good estimates of the expected return quickly by bootstrapping from other expected-return estimates. TD(λ) is a popular TD algorithm that combines basic TD learning with eligibility traces to further speed learning. The popularity of TD(λ) can be explained by its simple implementation, its low computational complexity, and its conceptually straightforward interpretation, given by its forward view. The forward view of TD(λ) (Sutton & Barto, 1998) is that the estimate at each time step is moved toward an update target known as as the λ-return; the corresponding algorithm is known as the λ-return algorithm. The λ-return is an estimate of the expected return based on both subsequent rewards and the expected-return estimates at subsequent states, with λ determining the precise way these are combined. The forward view is useful primarily for understanding the algorithm theoretically and intuitively. It is not clear how the λ-return could form the basis for an online algorithm, in which estimates are updated on every step, because the λ-return is generally not known until the end of the episode. Much clearer is the off-line case, in which values are updated only at the end of an episode, summing the value corrections corresponding to all the time steps of the episode. For this case, the overall update of the λ-return algorithm has been proven equal to the off-line version of TD(λ), in which the updates are computed on each step, but not used to actually change the value estimates until the end of the episode, when they are summed to produce an overall update (Sutton & Barto, 1998). The overall per-episode updates of the λ-return algorithm and of off-line TD(λ) are the same, even though the updates on each time step are different. By the end of the episode they must sum to the same overall update. One of TD(λ)’s most appealing features is that at each step it only requires temporally local information—the immediate next reward and next state; this enables the algorithm to be applied online. The online updates will be slightly dif-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Empirical Evaluation of True Online TD({\lambda})

The true online TD(λ) algorithm has recently been proposed (van Seijen and Sutton, 2014) as a universal replacement for the popular TD(λ) algorithm, in temporal-difference learning and reinforcement learning. True online TD(λ) has better theoretical properties than conventional TD(λ), and the expectation is that it also results in faster learning. In this paper, we put this hypothesis to the te...

متن کامل

True Online Emphatic TD(λ): Quick Reference and Implementation Guide

TD(λ) is the core temporal-difference algorithm for learning general state-value functions (Sutton 1988, Singh & Sutton 1996). True online TD(λ) is an improved version incorporating dutch traces (van Seijen & Sutton 2014, van Seijen, Mahmood, Pilarski & Sutton 2015). Emphatic TD(λ) is another variant that includes an “emphasis algorithm” that makes it sound for off-policy learning (Sutton, Mahm...

متن کامل

True Online Emphatic TD($\lambda$): Quick Reference and Implementation Guide

This document is a guide to the implementation of true online emphatic TD(λ), a model-free temporal-difference algorithm for learning to make long-term predictions which combines the emphasis idea (Sutton, Mahmood & White 2015) and the true-online idea (van Seijen & Sutton 2014). The setting used here includes linear function approximation, the possibility of off-policy training, and all the ge...

متن کامل

True Online Temporal-Difference Learning

The temporal-difference methods TD(λ) and Sarsa(λ) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD(λ) and true online Sarsa(λ), respectively (van Seijen and Sutton, 2014). Algorithm...

متن کامل

Implicit Temporal Differences

In reinforcement learning, the TD(λ) algorithm is a fundamental policy evaluation method with an efficient online implementation that is suitable for large-scale problems. One practical drawback of TD(λ) is its sensitivity to the choice of the step-size. It is an empirically well-known fact that a large step-size leads to fast convergence, at the cost of higher variance and risk of instability....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014